Preprocessing the dataset

MAIN DATA SET

CollegeDataSet <- college %>%
  select(INSTNM,REGION,IND_INC_AVG,GRAD_DEBT_MDN,CIP14BACHL,CIP54BACHL,C150_4,TUITIONFEE_IN,TUITIONFEE_OUT,C100_4_POOLED,RELAFFIL)

David Bartling

data1 <- CollegeDataSet %>%
  select(GRAD_DEBT_MDN, REGION)

data2 <- data1 %>%
  filter(REGION == 5 || REGION == 6) %>%
  mutate(
    REGION = recode(
      REGION,
      `5` = "se",
      `6` = "sw"))
## Warning: Unreplaced values treated as NA as .x is not compatible. Please
## specify replacements exhaustively or supply .default
data3 <- data2 %>%
  na.omit(cols = GRAD_DEBT_MDN, invert = FALSE)

Quan Tran

college_data_tran1 <- CollegeDataSet %>%
  select (CIP14BACHL,CIP54BACHL,C150_4)

Fahd Saleem

Student: Fahd Saleem (G01154081)

Question: What is the disparity between various religious affiliations of institutions and the average family income of independent students compared to those of non-religious institutions. Does the religious affiliation of an institution imply students being of a higher socioeconomic class?

Variables: IND_INC_AVG, RELAFFIL

  • IND_INC_AVG: Independent Income Average - The average income for independent students at an institution.
  • RELAFFIL: Religious Affiliation - The religious affiliation of an institution.

Preview: My question is in regards to the various university’s and their religious affiliation, I will be first showing the various religious institutions and how they vary in terms of an independent student’s average family income, I will also be showing the disparity between non religious institutions and religious institutions in terms of their average family income. This will be done through conducting a summary statistic, cleaning the data in attempts to facilitate the users’ view of the data in an interactive visualization, and lastly conduct a hypothesis test in which the question of whether or not there is a difference or disparity between the average family incomes will be answered!

  • The following portion satisfies the summary statistic requirement.

  • The code below serves the purpose of getting the summary statistics for non-religious institutions. Firstly as shown the dataset used is the college dataset which and the output of the following code is set to hasnoreligion in order to store the output, which will be used later on. Moving on, a filter is used to get the schools that have a religious affiliation of NA which is essentially all the institutions who have don’t have a religious affiliation. Moreover, a summarise function has been used in order to get the summary statistics such as the average, minimum, and the maximum for all institutions which are of a non-religious affiliation. In the output, we can see that the minimum value is seen to be zero; this could mean that this data might not be reliable, possibly due to a wrong entry in the original set. The summary statistics are used for the variable IND_INC_AVG, which is the independent income average for independent students. Overall this goes to show the statistics for non-religious institutions as labeled within the output.

##SUMMARY FOR INSTITUTIONS OF NON RELIGIOUS AFFILIATION
hasnoreli<-CollegeDataSet%>%
  filter(is.na(RELAFFIL))%>%
  summarise(
    RELAFFIL = "Non-Religious Institutions",
    avg = round(mean(IND_INC_AVG,na.rm = TRUE)),
    min = min(IND_INC_AVG,na.rm = TRUE),
    max = max(IND_INC_AVG,na.rm = TRUE),
    med = median(IND_INC_AVG,na.rm = TRUE), 
    SD = sd(IND_INC_AVG,na.rm = TRUE), 
    IQR = IQR(IND_INC_AVG,na.rm = TRUE)
  ) 
head(hasnoreli)
RELAFFIL avg min max med SD IQR
Non-Religious Institutions 20275 0 67438.15 19466.64 7549.05 8571.052
  • The following code below is used to get the summary statistics similar to the code above. However, rather than getting the summary statistics for Institutions that have no religious affiliation with this code chunk will get the statistics for schools that do have a religious affiliation. Firstly the college dataset is used, and the output of the code will be saved to the data entity called hasreli, meaning that the institutions have a religious affiliation. Next, we use a filter function, the filter function is checking for values which do have a valid numeric value, necessarily applicable values. The not symbol ! is used to ensure we are filtering for values that are not NA or null for the column RELAAFFIL. Next, a summarise function is used to get the summary statistics for religious institutions. The statistics consist of the average, minimum, and maximum for the independent average income of students who attend religious institutions.
##SUMMARY FOR INSTITUTIONS OF RELIGIOUS AFFILIATION
hasreli<- CollegeDataSet%>%
  filter(!is.na(RELAFFIL))%>%
  summarise(
    RELAFFIL = "Religious Institutions",
    avg = round(mean(IND_INC_AVG,na.rm = TRUE)),
    min = min(IND_INC_AVG,na.rm = TRUE),
    max = max(IND_INC_AVG,na.rm = TRUE),
    med = median(IND_INC_AVG,na.rm = TRUE), 
    SD = sd(IND_INC_AVG,na.rm = TRUE), 
    IQR = IQR(IND_INC_AVG,na.rm = TRUE)
  )
head(hasreli)
RELAFFIL avg min max med SD IQR
Religious Institutions 25824 557.6104 55639.34 26464.04 11532.98 17437.88
  • The code below is used to obtain the summary statistics for each individual institution that has a religious affiliation. Unlike the codes above, which get the summary statistics of all institutions categorized as a single row output, this code gets the statistics for each institution of religious affiliation. In order to achieve this, the group_by function was used in conjunction with the filter function. This allows for all the RELAFFIL or religious affiliations to be filtered so that only religious affiliations that are not NA are shown. I did this because RELAFFIL has its own specific value for null values, which corresponds to a -2 or -1; therefor any values which are in actuality null are considered non-religious affiliations. As a result, they are filtered out since I want to get only the institutions of religious affiliations. Next, I have grouped them by RELAFFIL, allowing for each individual type of religious affiliation of an institution to be shown. If you see the output it is prominent that all the values under the RELAFFIL column are numerical values, how will the user possibly know which religious affiliation is which? I took this into significant consideration, as shown in within upcoming code. In some forthcoming code, I will recode all the values to make it possible for viewers of the data to see the actual religious affiliation as a string rather than an uninterpretable numerical value. Back to this code chunk, I go on to conduct a summary statistic the same as before; lastly, the code is filtered of non-applicable values just as a precaution. Within the output, all individual institutions of religious affiliation are shown and its corresponding summary statistic.
##SUMMARY FOR EACH INDIVIDUAL RELIGIOUS AFFILIATION INSTITUTION
summary <- CollegeDataSet%>%
  filter(!is.na(RELAFFIL))%>%
  group_by(RELAFFIL) %>%
  summarise(
    avg = round(mean(IND_INC_AVG, na.rm = TRUE)),
    min = min(IND_INC_AVG, na.rm = TRUE),
    max = max(IND_INC_AVG, na.rm = TRUE),
    med = median(IND_INC_AVG,na.rm = TRUE), 
    SD = sd(IND_INC_AVG,na.rm = TRUE), 
    IQR = IQR(IND_INC_AVG,na.rm = TRUE)
  )%>%
filter(!is.na(avg))
head(summary)
RELAFFIL avg min max med SD IQR
22 28170 28170.111 28170.11 28170.111 NA 0.000
24 9987 7244.028 12730.69 9987.359 3879.657 2743.332
27 24151 13531.000 32716.35 26456.557 6712.750 9718.246
28 25369 10408.224 40329.69 25368.957 21157.671 14960.732
30 29045 4820.895 54766.45 30081.574 10567.190 11919.131
33 28901 13794.636 44006.67 28900.653 21363.133 15106.017
  • Unlike the previous codes, this code gets the summary statistics for all institutions regardless of religious affiliation. They are then categorized as either having a religious affiliation by the value of “True” or not having a religious affiliation categorized by “False.” This dataset is crucial as it is the basis for being used in the hypothesis test, in which we will see if there is a disparity in the averages between the groups, & if it implies a higher socioeconomic standing for those of a religious institution. The code below also contains a mutate function to recode some of the religious affiliation values, as previously mentioned, the values True and False are used. Originally without the mutate, we are shown NA values for the institutions with no religious affiliation. The recode recodes the non-religious with False. Next, the remaining default is replaced with True.
##SUMMARY FOR ALL SCHOOLS AND THEIR AFFILIATIONS
summaryall <- CollegeDataSet%>%
  group_by(INSTNM,RELAFFIL) %>%
  summarise(
    avg = round(mean(IND_INC_AVG, na.rm = TRUE)),
    min = min(IND_INC_AVG, na.rm = TRUE),
    max = max(IND_INC_AVG, na.rm = TRUE),
    med = median(IND_INC_AVG,na.rm = TRUE), 
    SD = sd(IND_INC_AVG,na.rm = TRUE), 
    IQR = IQR(IND_INC_AVG,na.rm = TRUE)
  )%>%
  mutate(
    RELAFFIL = recode(
      RELAFFIL,
      .missing =  "False",
      .default = "True"
))%>%
filter(!is.na(avg))
head(summaryall)
INSTNM RELAFFIL avg min max med SD IQR
Aaniiih Nakoda College False 9170 9169.706 9169.706 9169.706 NA 0
Abcott Institute False 10770 10770.245 10770.245 10770.245 NA 0
Abdill Career College Inc False 16590 16589.949 16589.949 16589.949 NA 0
Abilene Christian University True 18267 18267.444 18267.444 18267.444 NA 0
Abraham Baldwin Agricultural College False 25314 25313.870 25313.870 25313.870 NA 0
Academia Serrant Inc False 4524 4523.507 4523.507 4523.507 NA 0
  • The following code is used to recode the numerical values for RELAFFIL, which represents the various religious affiliations. This is necessary to do since the proceeding code will incorporate an interactive visual, and if the user is looking at the visual, all the numerical religious affiliations won’t make any sense unless a very long key is provided. However, I did not want to make the user experience bad and went through the excruciating process to recode all the values for religious affiliations so that the end-user could actually know what data they are looking at. This essentially allows for the values which were originally numeric to be replaced with the string value I have assigned it. Overall this was intended to immensely facilitate the end-user experience & allows the user viewing my visualization to know which religious affiliation corresponds with what average income. I’ve included head(updat_summary), which will enable you to see how each religious affiliation now has a name associated with it. Going back to the code head(summary), you can see RELAFFIL has only numeric values associated with it, making it almost impossible to know what religious affiliation is what.
updat_summary<-summary%>%
mutate(
    RELAFFIL = recode(
      RELAFFIL,
      .default = "Not reported",
      `-1`=  "Not reported",
      `-2`=  "Not applicable",
      `22`=  "American Evangelical Lutheran Church",            
      `24`=  "African Methodist Episcopal Zion Church",         
      `27`=  "Assemblies of God Church",            
      `28`=  "Brethren Church",         
      `30`=  "Roman Catholic",          
      `33`=  "Wisconsin Evangelical Lutheran Synod",            
      `34`=  "Christ and Missionary Alliance Church",           
      `35`=  "Christian Reformed Church",           
      `36`=  "Evangelical Congregational Church",           
      `37`=  "Evangelical Covenant Church of America",          
      `38`=  "Evangelical Free Church of America",          
      `39`=  "Evangelical Lutheran Church",         
      `40`=  "International United Pentecostal Church",         
      `41`=  "Free Will Baptist Church",            
      `42`=  "Interdenominational",         
      `43`=  "Mennonite Brethren Church",           
      `44`=  "Moravian Church",         
      `45`=  "North American Baptist",          
      `47`=  "Pentecostal Holiness Church",         
      `48`=  "Christian Churches and Churches of Christ",           
      `49`=  "Reformed Church in America",          
      `50`=  "Episcopal Church, Reformed",          
      `51`=  "African Methodist Episcopal",         
      `52`=  "American Baptist",            
      `53`=  "American Lutheran",           
      `54`=  "Baptist",         
      `55`=  "Christian Methodist Episcopal",
      `57`=  "Church of God",           
      `58`=  "Church of Brethren",          
      `59`=  "Church of the Nazarene",          
      `60`=  "Cumberland Presbyterian",         
      `61`=  "Christian Church (Disciples of Christ)",          
      `64`=  "Free Methodist",          
      `65`=  "Friends",         
      `66`=  "Presbyterian Church (USA)",           
      `67`=  "Lutheran Church in America",          
      `68`=  "Lutheran Church - Missouri Synod",            
      `69`=  "Mennonite Church",            
      `71`=  "United Methodist",            
      `73`=  "Protestant Episcopal",            
      `74`=  "Churches of Christ",          
      `75`=  "Southern Baptist",            
      `76`=  "United Church of Christ",         
      `77`=  "Protestant, not specified",           
      `78`=  "Multiple Protestant Denomination",            
      `79`=  "Other Protestant",            
      `80`=  "Jewish",          
      `81`=  "Reformed Presbyterian Church",            
      `84`=  "United Brethren Church",          
      `87`=  "Missionary Church Inc",           
      `88`=  "Undenominational",            
      `89`=  "Wesleyan",            
      `91`=  "Greek Orthodox",          
      `92`=  "Russian Orthodox",            
      `93`=  "Unitarian Universalist",          
      `94`=  "Latter Day Saints (Mormon Church)",           
      `95`=  "Seventh Day Adventists",          
      `97`=  "The Presbyterian Church in America",          
      `99`=  "Other (none of the above)",           
      `100`= "Original Free Will Baptist",          
      `101`= "Ecumenical Christian",            
      `102`= "Evangelical Christian",           
      `103`= "Presbyterian",            
      `105`= "General Baptist",         
      `106`= "Muslim",          
      `107`= "Plymouth Brethren"
    )
)
head(updat_summary)
RELAFFIL avg min max med SD IQR
American Evangelical Lutheran Church 28170 28170.111 28170.11 28170.111 NA 0.000
African Methodist Episcopal Zion Church 9987 7244.028 12730.69 9987.359 3879.657 2743.332
Assemblies of God Church 24151 13531.000 32716.35 26456.557 6712.750 9718.246
Brethren Church 25369 10408.224 40329.69 25368.957 21157.671 14960.732
Roman Catholic 29045 4820.895 54766.45 30081.574 10567.190 11919.131
Wisconsin Evangelical Lutheran Synod 28901 13794.636 44006.67 28900.653 21363.133 15106.017
  • The following portion satisfies the data visualization requirement.

  • The following code binds the updat_summary, hasreli, and hasnoreli into a new entity called summary_all. This data will be further used in a visualization to allow the user to compare the various averages for different religious institutions, and make comparisons with the overall averages for both the non-religious institutions as well as the overall average for religious institutions.

summary_all <- rbind(updat_summary,hasreli, hasnoreli)
  • The following visual shows all the various types of religious affiliations and the average income of independent students associated with that particular religious institution. This visualization is intended to show the differences among all the types of religious institutions. Most importantly, the right end of the visual contains two bars, one of which represents all religious institutions and the other, which represents all non-religious institutions. This will allow for viewers to gauge the disparity between both when compared to the various different types of religious institutions. For a more prominent view of the last two bars on the right, the next visualization will have them completely isolated from the rest for further and more detailed inspection. This visualization also has taken into account for the naming of all institutional religious affiliations as the original dataset categorized all institutional religious affiliations with a numerical datatype corresponding with a categorical name. All in all, this is a unique and interactive way to view the data, allowing for viewers to come to their conclusions and mess around with the interactive features such as hovering over the different bars to see its corresponding information.
summary_all%>%
  hchart(
    "column",
    hcaes(
      x = RELAFFIL,
      y = avg
    )
  ) %>%
  hc_tooltip(
    useHTML = TRUE,
    pointFormat = tooltip_table(
      x = combine("Religious Affiliation:", "Average Family income of independent students: $"),
      y = combine("{point.RELAFFIL}", "{point.avg}")
    )
  )
## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.
  • The code below simply uses the rbind function to bind the hasreli and the hasnoreli table contents together. The purpose of this code is used for the code below, which shows the difference between religious and non-religious institutions in terms of the average income of independent students.
summary_both <- rbind(hasreli, hasnoreli)
  • The following visualization is to show the difference in the averages of independent student income based on the group of the institution Either religiously affiliated or non-religiously affiliated institutions. This visualization offers an interactive component intended to facilitate the users viewing experience & overall has a charming aesthetical appeal. Most importantly, from this visualization, we can see the difference between religious and non-religious institutions in terms of average independent student income. This can show that there indeed is a difference between an institution being religiously affiliated or not. We can see that the average for religious institutions is much higher than non-religious, possibly implying that students of religious institutions are of higher socioeconomic standing. Later on, a proper hypothesis test will be conducted in order to see if this is indeed the case.
summary_both%>%
  hchart(
    "column",
    hcaes(
      x = RELAFFIL,
      y = avg
    )
  ) %>%
  hc_tooltip(
    useHTML = TRUE,
    pointFormat = tooltip_table(
      x = combine("Religious Affiliation:", "Average Family income of independent students: $"),
      y = combine("{point.RELAFFIL}", "{point.avg}")
    )
  )

- From this density plot, it is prominent that the highest density point is around 20,000 for institutions of no religious affiliation or false. Looking back at the interactive visualization, it matches up confirming our analysis. Next, looking at the blue-colored visual, it is prominent that the average is around 25,000 for independent individual income. This can be confirmed by looking at our interactive visualization which is similarly at $25,824 for institutions of religious affiliation.

summaryall%>%
  ggplot() +
  geom_density(aes(x = avg, fill = RELAFFIL),alpha = 0.5, position = "identity")+ 
  facet_wrap(~ RELAFFIL, scales = "free_x")

Riley Payung

#Im specifying only accounting for 4 year students at 100% completion time.
college_4_year_completion <- college %>%
  select(INSTNM,TUITIONFEE_IN,TUITIONFEE_OUT,C100_4_POOLED,REGION) %>%
  mutate(
    REGION = recode(
      REGION,
      "0" = "U.S. Service Schools",
      "1" = "New England Schools",
      "2" = "Mid Eastern Schools",
      "3" = "Great Lakes Schools",
      "4" = "Great Plains Schools",
      "5" = "Southeastern Schools",
      "6" = "Southwestern Schools",
      "7" = "Rocky Mountains Schools",
      "8" = "Far Western Schools",
      "9" = "Outlying Areas Schools"
  ))

college_4_year_completion$TUITIONFEE_IN <- as.numeric(college_4_year_completion$TUITIONFEE_IN)
college_4_year_completion$TUITIONFEE_OUT <- as.numeric(college_4_year_completion$TUITIONFEE_OUT)
college_4_year_completion$C100_4_POOLED <- as.numeric(college_4_year_completion$C100_4_POOLED)

Categories and where they are from:

0 U.S. Service Schools 1 New England (CT, ME, MA, NH, RI, VT) 2 Mid East (DE, DC, MD, NJ, NY, PA) 3 Great Lakes (IL, IN, MI, OH, WI) 4 Great Plains (IA, KS, MN, MO, NE, ND, SD) 5 Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV) 6 Southwest (AZ, NM, OK, TX) 7 Rocky Mountains (CO, ID, MT, UT, WY) 8 Far West (AK, CA, HI, NV, OR, WA) 9 Outlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)

Exploratory data analysis

Reina Tran

For this project, I will be answering:

  • Is there a relationship between majors (Engineering and History) at colleges and the graduation’s rate?

The variables that I will be using:

  • CIP14BACHL –> Bachelor’s degree in Engineering.

  • CIP54BACHL –> Bachelor’s degree in History.

  • C150_4 –> Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion)

college_data_tran1 %>%
  gather(CIP14BACHL:CIP54BACHL, key = "variable", value = "value") %>%
  ggplot() + 
    geom_histogram(mapping = aes(x = C150_4), position = "identity", 
                                              bins = 50, alpha = 0.5) +
    facet_wrap(. ~ variable) 
## Warning: Removed 9406 rows containing non-finite values (stat_bin).

college_data_tran1 %>%
ggplot() +
  geom_boxplot(
    mapping = aes(x = C150_4, y = CIP14BACHL))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 4703 rows containing missing values (stat_boxplot).

college_data_tran1 %>%
ggplot() +
  geom_boxplot(
    mapping = aes(x = C150_4, y = CIP54BACHL))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 4703 rows containing missing values (stat_boxplot).

I picked to explore the variation in my data using the histogram. I picked the histogram because it’s a good representation of the data, it gives me an overall view of what to expect. For this graph, it is showing the graduation’s rate of colleges for the 2 majors that I have picked (Enginerring and History). The distribution of the graph seems to be normal, with a bell-shaped. It does not appear that there are differences between the graduation’s rate within colleges where students are either working toward a bachelor in engineering or history. The box plots also show us that there are no outliers.

Summary Statistics:

college_data_tran1 %>%
summarize(
    mean = mean(CIP14BACHL, na.rm = TRUE),
    median = median(CIP14BACHL, na.rm = TRUE),
    min = min(CIP14BACHL, na.rm = TRUE),
    max = max(CIP14BACHL, na.rm = TRUE),
    sd = sd(CIP14BACHL, na.rm = TRUE),
    iqr = IQR(CIP14BACHL, na.rm = TRUE)
)
mean median min max sd iqr
0.0983805 0 0 2 0.3112709 0
college_data_tran1 %>%
summarize(
    mean = mean(CIP54BACHL, na.rm = TRUE),
    median = median(CIP54BACHL, na.rm = TRUE),
    min = min(CIP54BACHL, na.rm = TRUE),
    max = max(CIP54BACHL, na.rm = TRUE),
    sd = sd(CIP54BACHL, na.rm = TRUE),
    iqr = IQR(CIP54BACHL, na.rm = TRUE)
)
mean median min max sd iqr
0.2044801 0 0 2 0.4256297 0

The mean for Engineering is a lot less compared to the mean for History. Both have about the same statistics for other categories.

Model:

college_data_model <- lm(CIP14BACHL ~ CIP54BACHL, data = college_data_tran1)
college_data_model_df <- college_data_tran1 %>%
  add_predictions(college_data_model) %>%
  add_residuals(college_data_model)
college_data_model_df %>%
gather(CIP14BACHL:CIP54BACHL, key = "variable", value = "value") %>%
ggplot() +
  geom_point(aes(variable, value))
## Warning: Removed 902 rows containing missing values (geom_point).

college_data_model %>%
  tidy()
term estimate std.error statistic p.value
(Intercept) 0.0204519 0.0036263 5.639868 0
CIP54BACHL 0.3811063 0.0076801 49.622822 0
college_data_model %>%
  glance()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.2715682 0.2714579 0.2656843 2462.424 0 2 -616.7015 1239.403 1259.791 466.2347 6605

In conclusion, it does not seem like there is a relationship between majors (Engineering and History) at colleges and the graduation’s rate.

David Bartling

data3_plot <- ggplot(data2) +
geom_histogram(mapping = aes(x = GRAD_DEBT_MDN), binwidth = 5000) +
facet_wrap(~ REGION)

The first histogram shows that the most frequent debt that students in schools have in the southeastern region appear in the 10,000s. It also indicates that there are more schools compared to the southwestern region. In the southwestern region, the histogram shows that the debt is more frequent in the 20’000s, with less counts meaning less schools..

data stats

data3_stats <- data3 %>%
  group_by(REGION) %>%
summarize(
  sum = sum(GRAD_DEBT_MDN),
  mean = mean(GRAD_DEBT_MDN),
  min = min(GRAD_DEBT_MDN),
  max = max(GRAD_DEBT_MDN)
  )

The southeast region has schools that on average have more student debt after graduation then southwestern schools. They also have more overall debt. ### linear model

data3_model <- lm(GRAD_DEBT_MDN ~ REGION, data = data3)

data3_model %>%
  glance()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.0232559 0.0227522 8718.901 46.16685 0 2 -20364.33 40734.67 40751.38 147401292561 1939
data3_model %>%
  tidy()
term estimate std.error statistic p.value
(Intercept) 17274.788 235.8183 73.254657 0
REGIONsw -2946.452 433.6450 -6.794619 0

The r squared value of this model is close to zero, indicating that there is nearly no correlation between GRAD_DEBT_MDN and region. The P-value is very small, indicating that there is a significan relationship between the two variables.

Riley Payung

Question: Between the two tuition rates, in-state and out-of-state, does the graduation rate change based on the tuition rates of colleges, and do higher tuition rate colleges have a lower graduation rate?

For my data analysis, I am using the variables TUITIONFEE_IN, TUITIONFEE_OUT, and C100_4_POOLED which is, respectively, the tution rate of instate students, tuition rate of out of state students, and the completion rate of that college’s students based at 100% completion time, which is 4 years / 8 semesters. Since were not taking into account the students that pay out of state vs the students that pay in-state, we can assume that the graduation rate would remain the same for in-state and out-of-state tuitions in this case.

# Summary Stats of all schools:
summary_in <- college_4_year_completion %>%
  summarise(
    Avg_Cost_InState = mean(TUITIONFEE_IN, na.rm = TRUE),
    Min_Cost_InState = min(TUITIONFEE_IN, na.rm = TRUE),
    Max_Cost_InState = max(TUITIONFEE_IN, na.rm = TRUE),
    Std_Cost_InState = sd(TUITIONFEE_IN, na.rm = TRUE),
    med_Cost_InState = median(TUITIONFEE_IN,na.rm = TRUE), 
    IQR_InState = IQR(TUITIONFEE_IN,na.rm = TRUE)
  )
head(summary_in)
Avg_Cost_InState Min_Cost_InState Max_Cost_InState Std_Cost_InState med_Cost_InState IQR_InState
15120.68 0 74514 12730.86 11375 14166
# Summary Stats of all schools:
summary_out <- college_4_year_completion %>%
  summarise(
    Avg_Cost_OutOfState = mean(TUITIONFEE_IN, na.rm = TRUE),
    Min_Cost_OutOfState = min(TUITIONFEE_IN, na.rm = TRUE),
    Max_Cost_OutOfState = max(TUITIONFEE_IN, na.rm = TRUE),
    Std_Cost_OutOfState = sd(TUITIONFEE_IN, na.rm = TRUE),
    med_Cost_OutState = median(TUITIONFEE_OUT,na.rm = TRUE), 
    IQR_outState = IQR(TUITIONFEE_OUT,na.rm = TRUE)
  )
head(summary_out)
Avg_Cost_OutOfState Min_Cost_OutOfState Max_Cost_OutOfState Std_Cost_OutOfState med_Cost_OutState IQR_outState
15120.68 0 74514 12730.86 15000 15900
# Summary Stats by Region of Instate Tuition Costs.
summary_in_region <- college_4_year_completion %>%
  group_by(REGION) %>%
  summarise(
    Avg_Cost_InState = mean(TUITIONFEE_IN, na.rm = TRUE),
    Min_Cost_InState = min(TUITIONFEE_IN, na.rm = TRUE),
    Max_Cost_InState = max(TUITIONFEE_IN, na.rm = TRUE),
    Std_Cost_InState = sd(TUITIONFEE_IN, na.rm = TRUE),
    med_Cost_InState = median(TUITIONFEE_IN,na.rm = TRUE), 
    IQR_InState = IQR(TUITIONFEE_IN,na.rm = TRUE)
  )
summary_in_region
REGION Avg_Cost_InState Min_Cost_InState Max_Cost_InState Std_Cost_InState med_Cost_InState IQR_InState
Far Western Schools 14359.07 1104 54636 14634.406 8977.0 17255.00
Great Lakes Schools 16177.42 2000 56034 12161.502 12928.5 17901.00
Great Plains Schools 14292.11 480 52782 11061.220 10935.0 14232.75
Mid Eastern Schools 19537.28 2675 57208 14171.316 13944.0 20281.00
New England Schools 24583.56 3570 54770 16629.018 19156.5 28838.75
Outlying Areas Schools 6148.63 1040 13036 2041.274 6472.5 2157.00
Rocky Mountains Schools 10804.64 1835 52818 9175.745 7488.0 11092.00
Southeastern Schools 13130.93 1350 74514 10249.697 11075.0 12417.75
Southwestern Schools 10917.10 932 52498 9966.350 8000.0 12262.50
U.S. Service Schools 255.00 0 1020 510.000 0.0 255.00
# Summary Stats by Region of Out of State Tuition Costs.
summary_out_region <- college_4_year_completion %>%
  group_by(REGION) %>%
  summarise(
    Avg_Cost_OutState = mean(TUITIONFEE_OUT, na.rm = TRUE),
    Min_Cost_OutState = min(TUITIONFEE_OUT, na.rm = TRUE),
    Max_Cost_OutState = max(TUITIONFEE_OUT, na.rm = TRUE),
    Std_Cost_OutState = sd(TUITIONFEE_OUT, na.rm = TRUE),
    med_Cost_OutState = median(TUITIONFEE_OUT,na.rm = TRUE), 
    IQR_outState = IQR(TUITIONFEE_OUT,na.rm = TRUE)
  )
summary_out_region
REGION Avg_Cost_OutState Min_Cost_OutState Max_Cost_OutState Std_Cost_OutState med_Cost_OutState IQR_outState
Far Western Schools 18352.794 2840 54636 13444.171 13438 18115.00
Great Lakes Schools 19442.041 2000 56034 11237.597 16230 16488.00
Great Plains Schools 16095.760 480 52782 10891.502 14263 16537.75
Mid Eastern Schools 21974.130 2675 57208 13083.600 17648 18956.00
New England Schools 28584.000 6306 54770 14415.203 28656 23241.50
Outlying Areas Schools 6630.532 1040 15022 2608.042 6480 2056.50
Rocky Mountains Schools 15710.644 2250 52818 8879.471 14673 10008.50
Southeastern Schools 16523.072 1350 74514 9747.669 14290 11348.00
Southwestern Schools 14480.969 932 52498 9650.328 13438 11910.00
U.S. Service Schools 255.000 0 1020 510.000 0 255.00
# Graduation Rate by Region
summary_grad_rate_region <- college_4_year_completion %>%
  group_by(REGION) %>%
  summarise(
    Avg_Grad_rate = mean(C100_4_POOLED, na.rm = TRUE),
    Std_Grad_rate = sd(C100_4_POOLED, na.rm = TRUE)
  )
summary_grad_rate_region
REGION Avg_Grad_rate Std_Grad_rate
Far Western Schools 0.3589134 0.2422179
Great Lakes Schools 0.3489524 0.2212757
Great Plains Schools 0.3428751 0.2054895
Mid Eastern Schools 0.3925379 0.2477942
New England Schools 0.4927770 0.2364143
Outlying Areas Schools 0.1171596 0.1367742
Rocky Mountains Schools 0.2774934 0.1995911
Southeastern Schools 0.2971091 0.1961184
Southwestern Schools 0.2592190 0.2013747
U.S. Service Schools 0.7963250 0.0812408
summary_grad_rate_region %>%
  hchart(
    "column",
    hcaes(
      x = REGION,
      y = Avg_Grad_rate
    )
  ) %>%
  hc_tooltip(
    useHTML = TRUE,
    pointFormat = tooltip_table(
      x = combine("REGION/CLASSIFICATION:", "Average Graduation Rate:"),
      y = combine("{point.REGION}", "{point.Avg_Grad_rate}")
    )
  )

It is safe to say that we should remove the U.S. Service Schools since they do not charge tuition apart from Merchant Marine Academy; U.S. Service Schools have the highest average graduation rate of 79.63%. This could possibly cause some major outlying in the hypothesis testing. We will do a linear regression with service schools, and without and compare the results.

without_service <- college_4_year_completion %>%
  filter(REGION != "U.S. Service Schools")
with_service <- college_4_year_completion

Linear Model

without_service_model <- lm(TUITIONFEE_IN ~ C100_4_POOLED,data=without_service)

with_service_model <- lm(TUITIONFEE_IN ~ C100_4_POOLED, data=with_service)

without_service_model %>%
  glance()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.4982165 0.4979771 9674.265 2081.1 0 2 -22229.75 44465.5 44482.45 196167589885 2096
without_service_model %>%
  tidy()
term estimate std.error statistic p.value
(Intercept) 6677.646 382.7839 17.44495 0
C100_4_POOLED 42395.900 929.3459 45.61907 0
with_service_model %>%
  glance() %>%
  as_data_frame()
## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.4840636 0.483818 9822.425 1970.27 0 2 -22304.08 44614.16 44631.12 202608075568 2100
with_service_model %>%
  tidy() %>%
  as_data_frame()
term estimate std.error statistic p.value
(Intercept) 6833.177 388.1693 17.60360 0
C100_4_POOLED 41722.214 939.9495 44.38772 0
without_service_model %>%
  glance() %>%
  as_data_frame()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.4982165 0.4979771 9674.265 2081.1 0 2 -22229.75 44465.5 44482.45 196167589885 2096
without_service_model %>%
  tidy() %>%
  as_data_frame()
term estimate std.error statistic p.value
(Intercept) 6677.646 382.7839 17.44495 0
C100_4_POOLED 42395.900 929.3459 45.61907 0
with_service_df <- with_service %>%
  add_residuals(with_service_model) %>%
  add_predictions(with_service_model)
without_service_df <- without_service %>%
  add_residuals(without_service_model) %>%
  add_predictions(without_service_model)
with_service_df %>%
  ggplot() +
  geom_point(aes(x = C100_4_POOLED, y = TUITIONFEE_IN)) +
  geom_line(aes(x = C100_4_POOLED, y = pred), color="red",size=1)
## Warning: Removed 4956 rows containing missing values (geom_point).
## Warning: Removed 4911 rows containing missing values (geom_path).

without_service_df %>%
  ggplot() +
  geom_point(aes(x = C100_4_POOLED, y = TUITIONFEE_IN)) +
  geom_line(aes(x = C100_4_POOLED, y = pred), color="red", size=1)
## Warning: Removed 4954 rows containing missing values (geom_point).
## Warning: Removed 4909 rows containing missing values (geom_path).

with_service_df %>%
  ggplot(mapping = aes(x = resid)) +
  geom_histogram(binwidth = 100)
## Warning: Removed 4956 rows containing non-finite values (stat_bin).

without_service_df %>%
  ggplot(mapping = aes(x = resid)) +
  geom_histogram(binwidth = 100)
## Warning: Removed 4954 rows containing non-finite values (stat_bin).

Comparing the two data frames, one with service schools and one without, we can see that there really is no reason to remove the service schools, except for some minor improvements in the r^2 value.

In conclusion, there really isnt much of a correlation in tuition and graduation rate. Colleges seem to be all over the place, and the r^2 value of ~0.49 is not high enough to really say there is a correlation between tuition rates and graduation rates.

Fahd Saleem

  • The following portion satisfies hypothesis testing requirement.

  • Needed to create the simulated null distribution, in order to quantify how likely it is that a random chance model would generate the experimental result.

summaryall_test <- summaryall %>%
  specify(formula = avg ~ RELAFFIL) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 10000, type = "permute") %>%
  calculate(stat = "diff in means", order = combine("True", "False"))
head(summaryall_test)
replicate stat
1 -12.99704
2 -18.50042
3 575.22292
4 299.64724
5 -96.07602
6 49.64035
  • Needed to compute the experimental result for comparison with the null distribution.
summaryall_test_stat <- summaryall%>%
  specify(formula = avg ~ RELAFFIL)%>%
  calculate(stat = "diff in means", order = combine("True", "False"))
head(summaryall_test_stat)
stat
5484.92
  • The following p-value indicates that there is a failure to reject the null hypothesis. Overall, based on the proceeding simulation-based null distribution, we can conclude that there is an immense or significant difference between the two groups. More in specific, it is prominent that there is a vast difference between the two groups, either religious institutions and non-religious institutions. The significant difference is in regards to the average independent income of students at a particular institution. Since my p-value is less than the value of 0.05, I can conclude that I have sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis.
summaryall_test%>%
  get_p_value(obs_stat = summaryall_test_stat, direction = "both")
p_value
0
  • Given that we have a p-value of zero, it’s evident that there is a significant difference. The simulation-based null distribution visual just goes to show that there indeed is a significant difference between the two groups. Moreover, the red line shows the area of the probability density function, where the null hypothesis would be accepted. Since it is out of bounds or outside the curve, I can conclude that the probability of accepting the null hypothesis is zero.
summaryall_test%>%
  visualize() + 
  shade_p_value(obs_stat = summaryall_test_stat, direction = "both")

  • A crucial aspect of the analysis is to compute the confidence interval for the experimental result. The confidence interval, in essence, represents a plausible range consisting of values for the experimental result based on the variability of the conferred dataset. As shown the code chunk below is running a bootstrap simulation, which is needed since computing the interval requires us to run a bootstrap simulation.
bootstrap_summaryall <- summaryall%>%
  specify(formula = avg ~ RELAFFIL)%>%
  generate(reps = 10000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = combine("True", "False"))
head(bootstrap_summaryall)
replicate stat
1 5509.379
2 5687.259
3 4793.777
4 6111.076
5 5863.339
6 5185.511
  • The graph below is showing the confidence interval for the experimental result. Moreover, it shows how certain it is for the null distribution to have occurred. Furthermore, it goes to shows that the average null distribution is within the confidence interval. In its entirety, I can conclude that from the confidence interval plot, the actual or true difference is most likely between five thousand or six thousand points. So yes, in conclusion, we can say that there is a significant difference between the averages. There is a disparity between religious and non-religious institutions, and religious institutions do tend to have students with a higher average independent income, meaning that we can reject our null hypothesis. Students of a religious institution tend to be of a higher socioeconomic standing due to the difference in average independent income & our rejection of the null hypothesis favors this. Based on the conferred details of our test, I can conclude that I have sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis.
summaryall_ci <- bootstrap_summaryall %>%
  get_confidence_interval()
bootstrap_summaryall %>%
  visualise() + shade_confidence_interval(summaryall_ci)

  • Overall, based on everything conducted, it seems that there is indeed a vast disparity among the two groups. Initially, I selected the columns I needed to use, most prominently RELAFFIL and IND_INC_AVG. Then the summary statistics portion was conducted. The overall summary statistics for religious and non-religious institutions were performed. Moreover, the summary statistics for each individual religious affiliation was also conducted to be used in an interactive visualization for the end-user to view and compare. In addition to the full visualization, a more specific visual was created, showing just the two bars for the religious versus non-religious institutions. This visual revealed that there, in fact, is a difference between both and that those religious institutions did have students of a higher average income. Moreover, a density plot was made to see the density of the various averages; when compared to the prior visualization, the average incomes lined up with their corresponding religious affiliation, either false or true. Lastly, the hypothesis testing portion was conducted, in which it was found that the probability of accepting the null hypothesis was zero; as a result, we have a failure to accept the null hypothesis. Overall now it is prominent that, in fact, a disparity does exist and that religious institutions have independent students who get a higher average income. This can mean that in general, students of a religious institution have a higher average independent income, making them of a higher socioeconomic standing when compared to independent individuals who attend non religiously affiliated institutions.